Part I: Univariate Data Analysis (NBA Data)

Univariate is pretty much what it sounds like: one variable. When undertaking univariate data analysis, we need first and foremost to figure what type of variable it is that we’re working with. Once we do that, we can choose the appropriate use of the variable, either as an outcome or as a possible predictor.

Motivating Question

We’ll be working with data from every NBA player who was active during the 2018-19 season.

Here’s the data:

require(tidyverse)
nba<-read_rds("https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/nba_players_2018.Rds")

This data contains the following variables:

Codebook for NBA Data

Name Definition
namePlayer Player name
idPlayer Unique player id
slugSeason Season start and end
numberPlayerSeason Which season for this player
isRookie Rookie season, true or false
slugTeam Team short name
idTeam Unique team id
gp Games Played
gs Games Started
fgm Field goals made
fga Field goals attempted
pctFG Percent of field goals made
fg3m 3 point field goals made
fg3a 3 point field goals attempted
pctFG3 Percent of 3 point field goals made
pctFT Free Throw percentage
fg2m 2 point field goals made
fg2a 2 point field goals attempted
pctFG2 Percent of 2 point field goals made
agePlayer Player age
minutes Minutes played
ftm Free throws made
fta Free throws attempted
oreb Offensive rebounds
dreb Defensive rebounds
treb Total rebounds
ast Assists
blk Blocks
tov Turnovers
pf Personal fouls
pts Total points
urlNBAAPI Source url

We’re interested in the following questions:

To answer these questions we need to look at the following variables:

We’re going to go through a pretty standard set of steps for each variable. First, examine some cases. Second, based on our examination, we’ll try either a plot or a table. Once we’ve seen the plot or the table, we’ll think a bit about ordering, and then choose an appropriate measure of central tendency, and maybe variation.

Types of Variables

It’s really important to understand the types of variables you’re working with. Many times analysts are indifferent to this step particularly with larger datasets. This can lead to a great deal of confusion down the road. Below are the variable types we’ll be working with this semester and the definition of each.

Continuous Variables

A continuous variable can theoretically be subdivided at any arbitrarily small measure and can still be identified. You may have encountered further subdivision of continuous variables into “interval” or “ratio” data in other classes. We RARELY use these distinctions in practice. The distinction between a continuous and a categorical variable is hugely consequential, but the distinction between interval and ratio is not really all that important in practice.

The mean is the most widely used measure of central tendency for a continuous variable. If the distribution of the variable isn’t very symmetric or there are large outliers, then the median is a much better measure of central tendency.

Categorical Variables

A categorical variables divides the sample up into a set of mutually exclusive and exhaustive categories. Mutually exclusive means that each case can only be one, and exhaustive means that the categories cover every possible option. Categorical is sort of the “top” level classification for variables of this type. Within the broad classification of categorical there are multiple types of other variables.

Categorical: ordered

an ordered categorical variable has– you guessed it– some kind of sensible order that can be applied. For instance, the educational attainment of an individual: high school diploma, associates degree, bachelor’s degree, graduate degree– is an ordered categorical variable.

Ordered categorical variables should be arranged in the order of the variable, with proportions or percentages associated with each order. The mode, or the category with the highest proportion, is a reasonable measure of central tendency, but with fewer than ten categories the analyst should generally just show the proportion in each category.

Categorical: ordered, binary

An ordered binary variable has just two levels, but can be ordered. For instance, is a bird undertaking its first migration: yes or no? A “no” means that the bird has more than one.

The mean of a binary variable is exactly the same thing as the proportion of the sample with that characteristic. So, the mean of a binary variable for “first migration” where 1=“yes” will give the proportion of birds migrating for the first time.

An ordered binary variable coded as 0 or 1 can be summarized using the mean which is the same thing as the proportion of the sample with that characteristic.

Categorical: unordered

An unordered categorical variable has no sensible ordering that can be applied. Think about something like college major. There’s no “number” we might apply to philosophy that has any meaningful distance from a number we might apply to chemical engineering.

Unlike an ordered variable, an unordered categorical variable should be ordered in terms of the proportions falling into each of the categories. As with an unordered variable, it’s best just to show the proportions in each category for variables with less than ten levels. The mode is a reasonable single variable summary of an unordered categorical variable.

Categorical: unordered, binary

This kind of variable has no particular order, but can be just binary. A “1” means that the case has that characteristics, a “0” means the case does not have that characteristic. For instance, whether a tree is deciduous or not.

An unordered binary variable coded as 0 or 1 can also be summarized by the mean, which is the same thing as the proportion of the sample with that characteristic.

Formats for categorical variables

In R, categorical variables CAN be stored as text, numbers or even logicals. Don’t count on the data to help you out– you as the analyst need to figure this out.

Factors

We probably need to talk about factors. In R, a factor is a way of storing categorical variables. The factor provides additional information, including an ordering of the variable and a number assigned to each “level” of the factor. A categorical variable is a general term that’s understood across statistics. A factor variable is a specific R term. Most of the time it’s best not to have a categorical variable structured as a factor unless you know you want it to be a factor. More on this later . . .

The Process: #TrustTheProcess

I’m going to walk you through how an analyst might typically decide what type of variables they’re working with. It generally works like this:

  1. Take a look at a few observations and form a guess as to what type of variable it is.
  2. Based on that guess, create an appropriate plot or table.
  3. If the plot or table looks as expected, calculate some summary measures. If not, go back to 1.

“Glimpse” to start: what’s in here anyway?

The first thing we’re going to do with any dataset is just to take a quick look. We can call the data itself, but that will just show the first few cases and the first few variables. Far better is the glimpse command, which shows us all variables and the first few observations for all of the variables. Here’s a link to the codebook for this dataset:

The six variables we’re going to think about are field goals, free throw percentage, seasons played, rookie season, college attended, and conference played in.

glimpse(nba)
## Rows: 530
## Columns: 37
## $ namePlayer         <chr> "LaMarcus Aldridge", "Quincy Acy", "Steven Adams", …
## $ idPlayer           <dbl> 200746, 203112, 203500, 203518, 1628389, 1628959, 1…
## $ slugSeason         <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-1…
## $ numberPlayerSeason <dbl> 12, 6, 5, 2, 1, 0, 0, 0, 0, 0, 8, 5, 4, 3, 1, 1, 1,…
## $ isRookie           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE…
## $ slugTeam           <chr> "SAS", "PHX", "OKC", "OKC", "MIA", "CHI", "UTA", "C…
## $ idTeam             <dbl> 1610612759, 1610612756, 1610612760, 1610612760, 161…
## $ gp                 <dbl> 81, 10, 80, 31, 82, 10, 38, 19, 34, 7, 81, 72, 43, …
## $ gs                 <dbl> 81, 0, 80, 2, 28, 1, 2, 3, 1, 0, 81, 72, 40, 4, 80,…
## $ fgm                <dbl> 684, 4, 481, 56, 280, 13, 67, 11, 38, 3, 257, 721, …
## $ fga                <dbl> 1319, 18, 809, 157, 486, 39, 178, 36, 110, 10, 593,…
## $ pctFG              <dbl> 0.519, 0.222, 0.595, 0.357, 0.576, 0.333, 0.376, 0.…
## $ fg3m               <dbl> 10, 2, 0, 41, 3, 3, 32, 6, 25, 0, 96, 52, 9, 24, 6,…
## $ fg3a               <dbl> 42, 15, 2, 127, 15, 12, 99, 23, 74, 4, 280, 203, 34…
## $ pctFG3             <dbl> 0.2380952, 0.1333333, 0.0000000, 0.3228346, 0.20000…
## $ pctFT              <dbl> 0.847, 0.700, 0.500, 0.923, 0.735, 0.667, 0.750, 1.…
## $ fg2m               <dbl> 674, 2, 481, 15, 277, 10, 35, 5, 13, 3, 161, 669, 1…
## $ fg2a               <dbl> 1277, 3, 807, 30, 471, 27, 79, 13, 36, 6, 313, 1044…
## $ pctFG2             <dbl> 0.5277995, 0.6666667, 0.5960347, 0.5000000, 0.58811…
## $ agePlayer          <dbl> 33, 28, 25, 25, 21, 21, 23, 22, 23, 26, 28, 24, 25,…
## $ minutes            <dbl> 2687, 123, 2669, 588, 1913, 120, 416, 194, 428, 22,…
## $ ftm                <dbl> 349, 7, 146, 12, 166, 8, 45, 4, 7, 1, 150, 500, 37,…
## $ fta                <dbl> 412, 10, 292, 13, 226, 12, 60, 4, 9, 2, 173, 686, 6…
## $ oreb               <dbl> 251, 3, 391, 5, 165, 11, 3, 3, 11, 1, 112, 159, 48,…
## $ dreb               <dbl> 493, 22, 369, 43, 432, 15, 20, 16, 49, 3, 498, 739,…
## $ treb               <dbl> 744, 25, 760, 48, 597, 26, 23, 19, 60, 4, 610, 898,…
## $ ast                <dbl> 194, 8, 124, 20, 184, 13, 25, 5, 65, 6, 104, 424, 1…
## $ stl                <dbl> 43, 1, 117, 17, 71, 1, 6, 1, 14, 2, 68, 92, 54, 22,…
## $ blk                <dbl> 107, 4, 76, 6, 65, 0, 6, 4, 5, 0, 33, 110, 37, 13, …
## $ tov                <dbl> 144, 4, 135, 14, 121, 8, 33, 6, 28, 2, 72, 268, 58,…
## $ pf                 <dbl> 179, 24, 204, 53, 203, 7, 47, 13, 45, 4, 143, 232, …
## $ pts                <dbl> 1727, 17, 1108, 165, 729, 37, 211, 32, 108, 7, 760,…
## $ urlNBAAPI          <chr> "https://stats.nba.com/stats/playercareerstats?Leag…
## $ n                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ org                <fct> Texas, NA, Other, FC Barcelona Basquet, Kentucky, N…
## $ country            <chr> NA, NA, NA, "Spain", NA, NA, NA, NA, NA, NA, NA, "S…
## $ idConference       <int> 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, …

Continuous

Let’s start by taking a look at field goals. It seems pretty likely that this is a continuous variable. Let’s take a look at the top 50 spots.

nba%>% ## Start with the dataset
  select(namePlayer,slugTeam,fgm)%>% ## and then select a few variables
  arrange(-fgm)%>% ## arrange in reverse order of field goals
  print(n=50) ## print out the top 50
## # A tibble: 530 × 3
##    namePlayer            slugTeam   fgm
##    <chr>                 <chr>    <dbl>
##  1 James Harden          HOU        843
##  2 Bradley Beal          WAS        764
##  3 Kemba Walker          CHA        731
##  4 Giannis Antetokounmpo MIL        721
##  5 Kevin Durant          GSW        721
##  6 Paul George           OKC        707
##  7 Nikola Vucevic        ORL        701
##  8 LaMarcus Aldridge     SAS        684
##  9 Damian Lillard        POR        681
## 10 Karl-Anthony Towns    MIN        681
## 11 Donovan Mitchell      UTA        661
## 12 D'Angelo Russell      BKN        659
## 13 Klay Thompson         GSW        655
## 14 Stephen Curry         GSW        632
## 15 DeMar DeRozan         SAS        631
## 16 Russell Westbrook     OKC        630
## 17 Buddy Hield           SAC        623
## 18 Blake Griffin         DET        619
## 19 Nikola Jokic          DEN        616
## 20 Tobias Harris         MIN        611
## 21 Kyrie Irving          BOS        604
## 22 Devin Booker          PHX        586
## 23 Joel Embiid           PHI        580
## 24 CJ McCollum           POR        571
## 25 Julius Randle         NOP        571
## 26 Andre Drummond        DET        561
## 27 Kawhi Leonard         TOR        560
## 28 LeBron James          LAL        558
## 29 Jrue Holiday          NOP        547
## 30 Montrezl Harrell      LAC        546
## 31 Ben Simmons           PHI        540
## 32 Anthony Davis         NOP        530
## 33 Zach LaVine           CHI        530
## 34 Jordan Clarkson       CLE        529
## 35 Trae Young            ATL        525
## 36 Bojan Bogdanovic      IND        522
## 37 Pascal Siakam         TOR        519
## 38 Collin Sexton         CLE        519
## 39 Jamal Murray          DEN        513
## 40 Deandre Ayton         PHX        509
## 41 Luka Doncic           DAL        506
## 42 Khris Middleton       MIL        506
## 43 De'Aaron Fox          SAC        505
## 44 Andrew Wiggins        MIN        498
## 45 Kyle Kuzma            LAL        496
## 46 Mike Conley           MEM        490
## 47 Lou Williams          LAC        484
## 48 Steven Adams          OKC        481
## 49 Rudy Gobert           UTA        476
## 50 Clint Capela          HOU        474
## # ℹ 480 more rows

So what I’m seeing here is that field goals aren’t “clumped” at certain levels. Let’s confirm that by looking at a kernel density plot.

nba%>%
  ggplot(aes(x=fgm))+
  geom_density()

We can also use a histogram to figure out much the same thing.

nba%>%
  ggplot(aes(x=fgm))+
  geom_histogram()

Now, technically field goals don’t meet the definition I set out above as being a continuous variable because they aren’t divisible below a certain amount. Usually in practice though we just ignore this– this variable is “as good as” continuous, given that it varies smoothly over the range and isn’t confined to a relatively small set of possible values.

Quick Exercise: Do the same thing for field goal percentage and think about what kind of variable it is.

# INSERT CODE HERE

Measures for Continuous Variables

The mean is used most of the time for continuous variables, but it’s VERY sensitive to outliers. The median (50th percentile) is usually better, but it can be difficult to explain to general audiences.

nba%>%
  summarize(mean_fgm=mean(fgm))
## # A tibble: 1 × 1
##   mean_fgm
##      <dbl>
## 1     191.
nba%>%
  summarize(median_fgm=median(fgm))
## # A tibble: 1 × 1
##   median_fgm
##        <dbl>
## 1        157

In this case I’d really prefer the mean as a single measure of field goal production, but depending on the audience I still might just go ahead and use the median.

Quick Exercise What measure would you prefer for field goal percentage? Calculate that measure.

# INSERT CODE HERE

Categorical: ordered

Let’s take a look at player seasons.

nba%>%
  select(namePlayer,numberPlayerSeason)%>%
  arrange(-numberPlayerSeason)%>%
  print(n=50)
## # A tibble: 530 × 2
##    namePlayer        numberPlayerSeason
##    <chr>                          <dbl>
##  1 Vince Carter                      20
##  2 Dirk Nowitzki                     20
##  3 Jamal Crawford                    18
##  4 Tony Parker                       17
##  5 Tyson Chandler                    17
##  6 Pau Gasol                         17
##  7 Nene                              16
##  8 Carmelo Anthony                   15
##  9 Udonis Haslem                     15
## 10 LeBron James                      15
## 11 Zaza Pachulia                     15
## 12 Dwyane Wade                       15
## 13 Kyle Korver                       15
## 14 Luol Deng                         14
## 15 Devin Harris                      14
## 16 Dwight Howard                     14
## 17 Andre Iguodala                    14
## 18 JR Smith                          14
## 19 Trevor Ariza                      14
## 20 Andrew Bogut                      13
## 21 Jose Calderon                     13
## 22 Raymond Felton                    13
## 23 Amir Johnson                      13
## 24 Shaun Livingston                  13
## 25 Chris Paul                        13
## 26 Marvin Williams                   13
## 27 Lou Williams                      13
## 28 CJ Miles                          13
## 29 LaMarcus Aldridge                 12
## 30 J.J. Barea                        12
## 31 Channing Frye                     12
## 32 Rudy Gay                          12
## 33 Kyle Lowry                        12
## 34 Paul Millsap                      12
## 35 JJ Redick                         12
## 36 Rajon Rondo                       12
## 37 Thabo Sefolosha                   12
## 38 Marco Belinelli                   11
## 39 Mike Conley                       11
## 40 Kevin Durant                      11
## 41 Jared Dudley                      11
## 42 Marcin Gortat                     11
## 43 Gerald Green                      11
## 44 Al Horford                        11
## 45 Joakim Noah                       11
## 46 Thaddeus Young                    11
## 47 Nick Young                        11
## 48 Corey Brewer                      11
## 49 D.J. Augustin                     10
## 50 Jerryd Bayless                    10
## # ℹ 480 more rows

Looks like it might be continuous? Let’s plot it:

nba%>%
  ggplot(aes(x=numberPlayerSeason))+
  geom_histogram(binwidth = 1)

Nope. See how it falls into a small set of possible categories? This is an ordered categorical variable. That means we should calculate the proportions in each category

nba%>%
  group_by(numberPlayerSeason)%>%
  count(name="total_in_group")%>%
  ungroup()%>%
  mutate(proportion=total_in_group/sum(total_in_group))
## # A tibble: 20 × 3
##    numberPlayerSeason total_in_group proportion
##                 <dbl>          <int>      <dbl>
##  1                  0            105    0.198  
##  2                  1             89    0.168  
##  3                  2             56    0.106  
##  4                  3             43    0.0811 
##  5                  4             37    0.0698 
##  6                  5             33    0.0623 
##  7                  6             31    0.0585 
##  8                  7             25    0.0472 
##  9                  8             19    0.0358 
## 10                  9             20    0.0377 
## 11                 10             24    0.0453 
## 12                 11             11    0.0208 
## 13                 12              9    0.0170 
## 14                 13              9    0.0170 
## 15                 14              6    0.0113 
## 16                 15              6    0.0113 
## 17                 16              1    0.00189
## 18                 17              3    0.00566
## 19                 18              1    0.00189
## 20                 20              2    0.00377

What does this tell us?

Quick Exercise Create a histogram for player age. What does that tell us about the NBA?

# INSERT CODE HERE

Categorical: ordered, binary

Let’s take a look at the variable for Rookie season.

nba%>%select(namePlayer,isRookie)
## # A tibble: 530 × 2
##    namePlayer             isRookie
##    <chr>                  <lgl>   
##  1 LaMarcus Aldridge      FALSE   
##  2 Quincy Acy             FALSE   
##  3 Steven Adams           FALSE   
##  4 Alex Abrines           FALSE   
##  5 Bam Adebayo            FALSE   
##  6 Rawle Alkins           TRUE    
##  7 Grayson Allen          TRUE    
##  8 Deng Adel              TRUE    
##  9 Jaylen Adams           TRUE    
## 10 DeVaughn Akoon-Purcell TRUE    
## # ℹ 520 more rows

Okay, so that’s set to a logical. In R, TRUE or FALSE are special values that indicate the result of a logical question. In this it’s whether or not the player is a rookie.

Usually we want a binary variable to have at least one version that’s structured so that 1= TRUE and 2=FALSE. This makes data analysis much easier. Let’s do that with this variable.

This code uses ifelse to create a new variable called isRookiebin that’s set to 1 if the isRookie variable is true, and 0 otherwise.

nba<-nba%>%
  mutate(isRookie_bin=ifelse(isRookie==TRUE,1,0))

Now that it’s coded 0,1 we can calculate the mean, which is the same thing as the proportion of the players that are rookies.

nba%>%summarize(mean=mean(isRookie_bin))
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1 0.198

Categorical: unordered

Let’s take a look at which college a player attended, which is a good example of an unordered categorical variable. The org variable tells us which organization the player was in before playing in the NBA.

nba%>%
  select(org)%>%
  glimpse()
## Rows: 530
## Columns: 1
## $ org <fct> Texas, NA, Other, FC Barcelona Basquet, Kentucky, NA, Duke, NA, NA…

This look like team or college names, so this would be a categorical variable. Let’s take a look at the counts of players from different organizations:

nba%>%
  group_by(org)%>%
  count()%>%
  arrange(-n)%>%
  print(n=50)
## # A tibble: 68 × 2
## # Groups:   org [68]
##    org                        n
##    <fct>                  <int>
##  1 <NA>                     157
##  2 Other                     85
##  3 Kentucky                  25
##  4 Duke                      17
##  5 California-Los Angeles    15
##  6 Kansas                    11
##  7 Arizona                   10
##  8 Texas                     10
##  9 North Carolina             9
## 10 Michigan                   8
## 11 Villanova                  7
## 12 Indiana                    6
## 13 Southern California        6
## 14 Syracuse                   6
## 15 California                 5
## 16 Louisville                 5
## 17 Ohio State                 5
## 18 Wake Forest                5
## 19 Colorado                   4
## 20 Connecticut                4
## 21 Creighton                  4
## 22 FC Barcelona Basquet       4
## 23 Florida                    4
## 24 Georgia Tech               4
## 25 Michigan State             4
## 26 Oregon                     4
## 27 Utah                       4
## 28 Washington                 4
## 29 Wisconsin                  4
## 30 Boston College             3
## 31 Florida State              3
## 32 Georgetown                 3
## 33 Gonzaga                    3
## 34 Iowa State                 3
## 35 Marquette                  3
## 36 Maryland                   3
## 37 Miami (FL)                 3
## 38 North Carolina State       3
## 39 Notre Dame                 3
## 40 Oklahoma                   3
## 41 Purdue                     3
## 42 Southern Methodist         3
## 43 Stanford                   3
## 44 Tennessee                  3
## 45 Virginia                   3
## 46 Anadolu Efes S.K.          2
## 47 Baylor                     2
## 48 Butler                     2
## 49 Cincinnati                 2
## 50 Kansas State               2
## # ℹ 18 more rows

Here we have a problem. If we’re interested just in colleges, we’re going to need to structure this a bit more. The code below filters out three categories that we don’t want: missing data, anything classified as others, and sports teams from other countries. The last is incomplete– I probably missed some! If I were doing this for real, I would use a list of colleges and only include those names.

What I do below is to negate the str_detect variable by placing the ! in front of it. This means I want all of the cases that don’t match the pattern the supplied. The pattern makes heavy use of the OR operator |. I’m saying I don’t want to include players whose organization included the letters CB r KK and so on (these are common prefixes for sports organizations in other countries, I definitely did not look that up on Wikipedia. Ok, I did.).

nba%>%
  filter(!is.na(org))%>%
  filter(!org=="Other")%>%
  filter(!str_detect(org,"CB|KK|rytas|FC|B.C.|S.K.|Madrid"))%>%
  group_by(org)%>%
  count()%>%
  arrange(-n)%>%
  print(n=50)
## # A tibble: 57 × 2
## # Groups:   org [57]
##    org                        n
##    <fct>                  <int>
##  1 Kentucky                  25
##  2 Duke                      17
##  3 California-Los Angeles    15
##  4 Kansas                    11
##  5 Arizona                   10
##  6 Texas                     10
##  7 North Carolina             9
##  8 Michigan                   8
##  9 Villanova                  7
## 10 Indiana                    6
## 11 Southern California        6
## 12 Syracuse                   6
## 13 California                 5
## 14 Louisville                 5
## 15 Ohio State                 5
## 16 Wake Forest                5
## 17 Colorado                   4
## 18 Connecticut                4
## 19 Creighton                  4
## 20 Florida                    4
## 21 Georgia Tech               4
## 22 Michigan State             4
## 23 Oregon                     4
## 24 Utah                       4
## 25 Washington                 4
## 26 Wisconsin                  4
## 27 Boston College             3
## 28 Florida State              3
## 29 Georgetown                 3
## 30 Gonzaga                    3
## 31 Iowa State                 3
## 32 Marquette                  3
## 33 Maryland                   3
## 34 Miami (FL)                 3
## 35 North Carolina State       3
## 36 Notre Dame                 3
## 37 Oklahoma                   3
## 38 Purdue                     3
## 39 Southern Methodist         3
## 40 Stanford                   3
## 41 Tennessee                  3
## 42 Virginia                   3
## 43 Baylor                     2
## 44 Butler                     2
## 45 Cincinnati                 2
## 46 Kansas State               2
## 47 Louisiana State            2
## 48 Memphis                    2
## 49 Missouri                   2
## 50 Murray State               2
## # ℹ 7 more rows

That looks better. Which are the most common colleges and universities that send players to the NBA?

Quick Exercise Arrange the number of players by team in descending order.

# INSERT CODE HERE

Categorical: unordered, binary

There are two conference in the NBA, eastern and western. Let’s take a look at the variable that indicates which conference the payer played in that season.

nba%>%select(idConference)%>%
  glimpse()
## Rows: 530
## Columns: 1
## $ idConference <int> 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2, …

It looks like conference is structured as numeric, but a “1” or a “2”. Because it’s best to have binary variables structured as “has the characteristic” or “doesn’t have the characteristic” we’re going to create a variable for western conference that’s set to 1 if the player was playing in the western conference and 0 if the player was not (this is the same as playing in the eastern conference).

nba<-nba%>%
  mutate(west_conference=ifelse(idConference==1,1,0))

Once we’ve done that, we can see how many players played in each conference.

nba%>%
  summarize(mean(west_conference))
## # A tibble: 1 × 1
##   `mean(west_conference)`
##                     <dbl>
## 1                   0.508

Makes sense!

Quick Exercise:* create a variable for whether or not the player is from the USA. Calculate the proportion of players from the USA in the NBA. The coding on country is … decidedy US-centric, so you’ll need to think about this one a bit.

# INSERT CODE HERE

Analysis

Ok, now that we know how this works, we can do some summary analysis. First of all, what does the total number of field goals made look like by college?

We know that field goals are continuous (sort of) so let’s summarize them via the mean. We know that college is a categorical variable, so we’ll use that to group the data. This is one of our first examples of a conditiona mean, which we’ll use a lot.

Top 50 Colleges by Total FG

nba%>%
  filter(!is.na(org))%>%
  filter(!org=="Other")%>%
  filter(!str_detect(org,"CB|KK|rytas|FC|B.C.|S.K.|Madrid"))%>%
  group_by(org)%>%
  summarize(mean_fg=sum(fgm))%>%
  arrange(-mean_fg)%>%
  print(n=50)
## # A tibble: 57 × 2
##    org                    mean_fg
##    <fct>                    <dbl>
##  1 Kentucky                  6594
##  2 Duke                      4623
##  3 Texas                     3437
##  4 California-Los Angeles    3382
##  5 Kansas                    2765
##  6 Arizona                   2101
##  7 Oklahoma                  1767
##  8 Southern California       1758
##  9 Louisville                1679
## 10 North Carolina            1659
## 11 Indiana                   1522
## 12 Ohio State                1486
## 13 Michigan                  1392
## 14 Wake Forest               1364
## 15 Connecticut               1299
## 16 Villanova                 1222
## 17 Georgia Tech              1169
## 18 Tennessee                 1095
## 19 Stanford                   949
## 20 Utah                       943
## 21 Marquette                  873
## 22 Gonzaga                    863
## 23 Michigan State             820
## 24 Colorado                   818
## 25 Virginia                   816
## 26 Maryland                   811
## 27 Missouri                   756
## 28 California                 734
## 29 Florida State              733
## 30 Georgetown                 717
## 31 Memphis                    620
## 32 Florida                    618
## 33 North Carolina State       598
## 34 Boston College             586
## 35 Louisiana State            583
## 36 Syracuse                   567
## 37 Iowa State                 523
## 38 Butler                     459
## 39 Wisconsin                  456
## 40 Creighton                  432
## 41 Oregon                     352
## 42 Texas A&M                  322
## 43 Baylor                     312
## 44 Providence                 291
## 45 Purdue                     275
## 46 Notre Dame                 263
## 47 Ulkerspor                  252
## 48 Southern Methodist         246
## 49 Oklahoma State             242
## 50 West Virginia              236
## # ℹ 7 more rows

Next, what about field goal percentage?

Top 50 Colleges by Average Field Goal Percent

nba%>%
  filter(!is.na(org))%>%
  filter(!org=="Other")%>%
  filter(!str_detect(org,"CB|KK|rytas|FC|B.C.|S.K.|Madrid"))%>%
  group_by(org)%>%
  summarize(mean_ftp=mean(pctFT))%>%
  arrange(-mean_ftp)%>%
  print(n=50)
## # A tibble: 57 × 2
##    org                  mean_ftp
##    <fct>                   <dbl>
##  1 Tennessee               0.842
##  2 Virginia                0.833
##  3 Oklahoma                0.823
##  4 North Carolina State    0.817
##  5 West Virginia           0.804
##  6 Ulkerspor               0.803
##  7 Missouri                0.802
##  8 Wake Forest             0.802
##  9 Florida State           0.801
## 10 Murray State            0.798
## 11 Iowa State              0.795
## 12 Notre Dame              0.792
## 13 Memphis                 0.788
## 14 Florida                 0.784
## 15 Michigan                0.783
## 16 Stanford                0.779
## 17 Georgetown              0.775
## 18 Marquette               0.774
## 19 Utah                    0.770
## 20 Kansas State            0.767
## 21 Butler                  0.762
## 22 Gonzaga                 0.761
## 23 North Carolina          0.756
## 24 Villanova               0.755
## 25 Texas                   0.752
## 26 Connecticut             0.748
## 27 Providence              0.747
## 28 Boston College          0.742
## 29 Michigan State          0.730
## 30 Kansas                  0.729
## 31 Indiana                 0.729
## 32 Duke                    0.728
## 33 Baylor                  0.726
## 34 Arizona                 0.721
## 35 Pallacanestro Biella    0.718
## 36 Wisconsin               0.712
## 37 Kentucky                0.712
## 38 Georgia Tech            0.712
## 39 Louisiana State         0.709
## 40 Creighton               0.698
## 41 Maryland                0.695
## 42 Vanderbilt              0.688
## 43 Washington              0.680
## 44 Louisville              0.679
## 45 Ohio State              0.679
## 46 California              0.675
## 47 Southern Methodist      0.673
## 48 Oregon                  0.662
## 49 Texas A&M               0.652
## 50 Southern California     0.648
## # ℹ 7 more rows

Quick Exercise Calculate field goals made by player season.

# INSERT CODE HERE

Quick Exercise Calculate free throw percent made by player season.

# INSERT CODE HERE

Part II: Agenda (Data: 2020 Michigan Exit Polls)

Our tools depend on the type of variables we are trying to graph.

Returning to the “gender” gap

Conditional variation involves examining how the values of two or more variables are related to one another. Earlier we made these comparisons by creating different tibbles and then comparing across tibbles, but we can also make comparisons without creating multiple tibbles. So load in the Michigan 2020 Exit Poll Data.

library(tidyverse)
library(scales)
mi_ep <- read_rds("https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/MI2020_ExitPoll_small.rds")
MI_final_small <- mi_ep %>%
  filter(preschoice=="Donald Trump, the Republican" | preschoice=="Joe Biden, the Democrat") %>%
  mutate(BidenVoter=ifelse(preschoice=="Joe Biden, the Democrat",1,0),
         TrumpVoter=ifelse(BidenVoter==1,0,1),
         AGE10=ifelse(AGE10==99,NA,AGE10))

We learned that if we count using multiple variables that R will count within values. Can we use this to analyze how this varies by groups? Let’s see!

MI_final_small %>%
  filter(AGE10==1) %>%
  count(preschoice,SEX) %>%
  mutate(PctSupport = n/sum(n),
         PctSupport = round(PctSupport, digits=2))
## # A tibble: 4 × 4
##   preschoice                     SEX     n PctSupport
##   <chr>                        <dbl> <int>      <dbl>
## 1 Donald Trump, the Republican     1     7       0.22
## 2 Donald Trump, the Republican     2     1       0.03
## 3 Joe Biden, the Democrat          1     9       0.28
## 4 Joe Biden, the Democrat          2    15       0.47

Here we have broken everything out by both preschoice and SEX but the PctSupport is not quite what we want because it is the fraction of responses (out of 1) that are in each row rather than the proportion of support for each candidate by sex.

To correct this and to perform the functions within a value we need to use the group_by function.

We can use the group_by command to organize our data a bit better. What group_by does is to run all subsequent code separately according to the defined group.

So instead of running a count or summarize separately for both Males and Females as we did above, we can group_by the variable SEX.chr (or FEMALE or SEX – it makes no difference as they are all equivalent) and then preform the subsequent commands. So here we are going to filter to select those who are 24 and below and then we are going to count the number of Biden and Trump supporters within each value of SEX.chr

MI_final_small %>%
  filter(AGE10==1) %>%
  group_by(SEX) %>%
  count(preschoice)
## # A tibble: 4 × 3
## # Groups:   SEX [2]
##     SEX preschoice                       n
##   <dbl> <chr>                        <int>
## 1     1 Donald Trump, the Republican     7
## 2     1 Joe Biden, the Democrat          9
## 3     2 Donald Trump, the Republican     1
## 4     2 Joe Biden, the Democrat         15

Note that any functions of the data are also now organized by that grouping, so if we were to manually compute the proportions using the mutation approach discussed above we would get:

MI_final_small %>%
  filter(AGE10==1) %>%
  group_by(SEX) %>%
  count(preschoice) %>%
  mutate(PctSupport = n/sum(n),
         PctSupport = round(PctSupport, digits=2))
## # A tibble: 4 × 4
## # Groups:   SEX [2]
##     SEX preschoice                       n PctSupport
##   <dbl> <chr>                        <int>      <dbl>
## 1     1 Donald Trump, the Republican     7       0.44
## 2     1 Joe Biden, the Democrat          9       0.56
## 3     2 Donald Trump, the Republican     1       0.06
## 4     2 Joe Biden, the Democrat         15       0.94

So you can see that PctSupport sums to 2.0 because it sums to 1.0 within each value of the grouping variable SEX.

If we wanted the fraction of voters who are in each unique category - so that the percentage of all the categories sum to 1.0 – we would want to ungroup before doing the mutation that calculates the percentage. So here we are doing the functions after the group_by() separately for each value of the grouping variables (here SEX) and then we are going to then undo that and return to the entire dataset.

MI_final_small %>%
  filter(AGE10==1) %>%
  group_by(SEX) %>%
  count(preschoice) %>%
  ungroup() %>%
  mutate(PctSupport = n/sum(n),
         PctSupport = round(PctSupport, digits=2))
## # A tibble: 4 × 4
##     SEX preschoice                       n PctSupport
##   <dbl> <chr>                        <int>      <dbl>
## 1     1 Donald Trump, the Republican     7       0.22
## 2     1 Joe Biden, the Democrat          9       0.28
## 3     2 Donald Trump, the Republican     1       0.03
## 4     2 Joe Biden, the Democrat         15       0.47

If we are just interested in the proportion and we do not care about the number of respondents in each value (although here it seems relevant!) we could also group_by and then summarize as follows:

MI_final_small %>%
  filter(AGE10==1) %>%
  group_by(SEX) %>%
  summarize(PctBiden = mean(BidenVoter),
          PctTrump = mean(TrumpVoter)) %>%
  mutate(PctBiden = round(PctBiden, digits =2),
         PctTrump = round(PctTrump, digits =2))
## # A tibble: 2 × 3
##     SEX PctBiden PctTrump
##   <dbl>    <dbl>    <dbl>
## 1     1     0.56     0.44
## 2     2     0.94     0.06

Because we have already filtered to focus only on Biden and Trump voters, we don’t actually need both since PctBiden = 1 - PctTrump and PctTrump = 1 - PctBiden.

Note that we can have multiple groups. So if we want to group by age and by sex we can do the following…

MI_final_small %>%
  group_by(SEX, AGE10) %>%
  summarize(PctBiden = mean(BidenVoter)) %>%
  mutate(PctBiden = round(PctBiden, digits =2))
## # A tibble: 22 × 3
## # Groups:   SEX [2]
##      SEX AGE10 PctBiden
##    <dbl> <dbl>    <dbl>
##  1     1     1     0.56
##  2     1     2     0.58
##  3     1     3     0.58
##  4     1     4     0.76
##  5     1     5     0.58
##  6     1     6     0.42
##  7     1     7     0.46
##  8     1     8     0.56
##  9     1     9     0.61
## 10     1    10     0.57
## # ℹ 12 more rows

We can also save it for later analysis and then filter or select the results. For example:

SexAge <- MI_final_small %>%
  group_by(SEX, AGE10) %>%
  summarize(PctBiden = mean(BidenVoter)) %>%
  mutate(PctBiden = round(PctBiden, digits =2)) %>%
  drop_na()

So if we want to look at the Biden support by age among females (i.e., SEX==2) we can look at:

SexAge %>%
  filter(SEX == 2)
## # A tibble: 10 × 3
## # Groups:   SEX [1]
##      SEX AGE10 PctBiden
##    <dbl> <dbl>    <dbl>
##  1     2     1     0.94
##  2     2     2     0.93
##  3     2     3     0.69
##  4     2     4     0.71
##  5     2     5     0.52
##  6     2     6     0.6 
##  7     2     7     0.63
##  8     2     8     0.74
##  9     2     9     0.69
## 10     2    10     0.61

And for Men…

Dicrete Variable By Discrete Variable (Barplot)

If we are working with discrete/categorical/ordinal/data — i.e., variables that take on a finite (and small) number of unique values then we are interested in how to compare across bar graphs.

Before we used geom_bar to plot the number of observations associated with each value of a variable. But we often want to know how the number of observations may vary according to a second variable. For example, we care not only about why voters reported that they supported Biden or Trump in 2020 but we are also interested in knowing whether Biden and Trump voters were voting for similar or different reasons. Did voters differ in terms of why they were voting for a candidate in addition to who they were voting for? If so, this may suggest something about what each set of voters were looking for in a candidate.

Let’s first plot the barplot and then plot the barplot by presidential vote choice for the Michigan Exit Poll we were just analyzing.

We are interested in the distribution of responses to the variable Quality and we only care about voters who voted for either Biden or Trump (preschoice) so let’s select those variables and filter using preschoice to select those respondents. We have an additional complication that the question was only asked of half of the respondents and some that were asked refused to answer. To remove these respondents we want to drop_na (note that this will drop every observation with a missing value – this is acceptable because we have used select to focus on the variables we are analyzing, but if we did not use select it would have dropped an observation with missing data in any variable. We could get around this using drop_na(Quality) if we wanted). A final complication is that some respondents did not answer the question they were asked so we have to use filter to remove respondents with missing observations.

Now we include labels – note how we are suppressing the x-label because the value labels are self-explanatory in this instance and add the geom_bar as before.

mi_ep %>% 
    select(Quality,preschoice) %>%
    filter(preschoice == "Joe Biden, the Democrat" | preschoice == "Donald Trump, the Republican") %>%
    drop_na() %>%
    filter(Quality != "[DON'T READ] Don’t know/refused") %>%
    ggplot(aes(x= Quality)) +     
    labs(y = "Number of Voters",
         x = "",
         title = "Michigan 2020 Exit Poll: Reasons for voting for a candidate") +
    geom_bar(color="black") 

Note that if we add coord_flip that we will flip the axes of the graph. (We could also have done this by changing aes(y= Quality), but then we would also have to change the associated labels.)

mi_ep %>% 
    select(Quality,preschoice) %>%
    filter(preschoice == "Joe Biden, the Democrat" | preschoice == "Donald Trump, the Republican") %>%
    drop_na() %>%
    filter(Quality != "[DON'T READ] Don’t know/refused") %>%
    ggplot(aes(x= Quality)) +     
    labs(y = "Number of Voters",
         x = "",
         title = "Michigan 2020 Exit Poll: Reasons for voting for a candidate") +
    geom_bar(color="black") + 
  coord_flip()

So enough review, lets add another dimension to the data. To show how the self-reported reasons for voting for a presidential candidate varied by vote choice we are going to use the fill of the graph to create different color bars depending on the value of the character or factor variable that is used to fill.

So we are going to include as a ggplot aesthetic a character or factor variable as a fill (here fill=preschoice) and then we are going to also include fill in the labs function to make sure that we label the meaning of the values being plotted. The other change we have made is in geom_bar where we used position=dodge to make sure that the bars are plotted next to one-another rather than on top of one another.

mi_ep %>% 
    select(Quality,preschoice) %>%
    filter(preschoice == "Joe Biden, the Democrat" | preschoice == "Donald Trump, the Republican") %>%
    drop_na() %>%
    filter(Quality != "[DON'T READ] Don’t know/refused") %>%
    ggplot(aes(x= Quality, fill = preschoice)) +     
    labs(y = "Number of Voters",
         x = "",
         title = "Michigan 2020 Exit Poll: Reasons for voting for a candidate",
         fill = "Self-Reported Vote") +
    geom_bar(color="black", position="dodge") +
    coord_flip()

For fun, see what happens when you do not use postion=dodge. Also see what happens if you do not flip the coordinates using coord_flip.

It is important to note that the fill variable has to be a character or a factor. If we want to graph self-reported vote by sex, for example, we need to redefine the variable for the purposes of ggplot as follows. Note that because we are not mutating it and we are only defining it to be a factor within the `ggplot object, this redefinition will not stick. Note also the problem caused by uninformative values in SEX – can you change it.

mi_ep %>% 
    filter(preschoice == "Joe Biden, the Democrat" | preschoice == "Donald Trump, the Republican") %>%
    ggplot(aes(x= preschoice, fill = factor(SEX))) +     
    labs(y = "Number of Respondents",
         x = "",
         title = "Vote by Respondent Sex",
         fill = "Sex") +
    geom_bar(color="black", position="dodge") +
    coord_flip()

Quick Exercise The barplot we just produced does not satisfy our principles of visualization because the fill being used is uninterpretable to those unfamiliar with the dataset. Redo the code to use a fill variable that produces an informative label. Hint: don’t overthink.

# INSERT CODE HERE

Part III: From Gender Gap to Mode of Polling (Data: 2020 Presidential Polling)

Suppose we were concerned with whether some polls might give different answers because of variation in who the poll is able to reach using that method. People who take polls via landline phones (do you even know what that is?) might differ from those who take surveys online. Or people contacted using randomly generated phone numbers (RDD) may differ from those contacted from a voter registration list that has had telephone numbers merged onto it.

Polls were done using lots of different methods in 2020.

Loading the data

require(tidyverse)
Pres2020.PV <- read_rds(file="https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/Pres2020_PV.Rds")
Pres2020.PV <- Pres2020.PV %>%
                mutate(Trump = Trump/100,
                      Biden = Biden/100,
                      margin = Biden - Trump)
Pres2020.PV %>%
  count(Mode)
## # A tibble: 9 × 2
##   Mode                 n
##   <chr>            <int>
## 1 IVR                  1
## 2 IVR/Online          47
## 3 Live phone - RBS    13
## 4 Live phone - RDD    51
## 5 Online             366
## 6 Online/Text          1
## 7 Phone - unknown      1
## 8 Phone/Online        19
## 9 <NA>                29

This raises the question of – how do we visualization variation in a variable by another variable? More specifically, how can we visualize how the margin we get using one type of survey compares to the margin from another type of poll? (We cannot use a scatterplot because the data is from different observations (here polls).)

We could do this using earlier methods by selecting polls with a specific interview method (“mode”) and then plotting the margin (or Trump or Biden), but that will produce a bunch of separate plots that may be hard to directly compare. (In addition to having more things to look at we would want to make sure that the scale of the x-axis and y-axis are similar.)

We can plot another “layer” of data in `ggplot using the fill paramter. Previously we used it to make the graphs look nice by choosing a particular color. But if we set fill to be a variable in our tibble then ggplot will plot the data seperately for each unique value in the named variable. So if we want to plot the histogram of margin for two types of polls we can use the fill argument in ggplot to tell R to produce different fills depending on the value of that variable.

Pres2020.PV %>% 
  filter(Mode == "IVR/Online" | Mode == "Live phone - RDD") %>%
    ggplot(aes(x= margin, fill = Mode)) +     
  labs(y = "Number of Polls",
         x = "Biden- Trump Margin",
         title = "Biden-Trump Margin for Two Types of Polls",
        fill = "Mode of Interview") +
    geom_histogram(bins=10, color="black", position="dodge") + 
    scale_x_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1))

Quick Exercise Try running the code without the filter. What do you observe? How useful is this? Why or why not?

# INSERT CODE

While informative, it can be hard to compare the distribution of more than two categories using such methods. To compare the variation across more types of surveys we need to use a different visualization that summarizes the variation in the variable of interest a bit more. One common visualization is the boxplot which reports the mean, 25th percentile (i.e., the value of the data if we sort the data from lowest to highest and take the value of the observation that is 25% of the way through), the 75th percentile, the range of values, and notable outliers.

Let’s see what the boxplot of survey mode looks like after we first drop surveys that were conducted using modes that were hardly used (or missing).

Pres2020.PV %>% 
  filter(Mode != "IVR" & Mode != "Online/Text" & Mode != "Phone - unknown" & Mode != "NA") %>%
  ggplot(aes(x = Mode, y = margin)) + 
    labs(x = "Mode of Survey Interview",
         y = "Biden- Trump Margin",
         title = "2020 Popular Vote Margin by Type of Poll") +
    geom_boxplot(fill = "slateblue") +
    scale_y_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1))

We can also flip the graph if we think it makes more sense to display it in a different orientation using coord_flip. (We could, of course, also redefine the x and y variables in the `ggplot object, but it is useful to have a command to do this to help you determine which orientiation is most useful).

Pres2020.PV %>% 
  filter(Mode != "IVR" & Mode != "Online/Text" & Mode != "Phone - unknown" & Mode != "NA") %>%
  ggplot(aes(x = Mode, y = margin)) + 
    labs(x = "Mode of Survey Interview",
         y = "Biden- Trump Margin",
         title = "2020 Popular Vote Margin by Type of Poll") +
    geom_boxplot(fill = "slateblue") +
    scale_y_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
    coord_flip()

A downside of the boxplot is that it can be hard to tell how the data varies within each box. Is it equally spread out? How much data are contained in the lines (which are simply 1.5 times the height of the box)? To get a better handle on this we can use a “violin” plot that dispenses with a standard box and instead tries to plot the distribution of data within each category.

Pres2020.PV %>% 
  filter(Mode != "IVR" & Mode != "Online/Text" & Mode != "Phone - unknown" & Mode != "NA") %>%
  ggplot(aes(x=Mode, y=margin)) + 
    xlab("Mode") + 
    ylab("Biden- Trump Margin") +
    geom_violin(fill="slateblue")

It is also hard to know how much data is being plotted. If some modes have 1000 polls and others have only 5 that seems relevant.

Quick Exercise We have looked at the difference in margin. How about differences in the percent who report supporting Biden and Trump? What do you observe. Does this suggest that the different ways of contacting respondents may matter in terms of who responds? Is there something else that may explain the differences (i.e., what are we assuming when making this comparison)?

# INSERT CODE HERE

Quick Exercise Some claims have been made that polls that used multiple ways of contacting respondents were better than polls that used just one. Can you evaluate whether there were differences in so-called “mixed-mode” surveys compared to single-mode surveys? (This requires you to define a new variable based on Mode indicating whether survey is mixed-mode or not.)

# INSERT CODE HERE

Continuous Variable By Continuous Variable (Scatterplot)

When we have two continuous variables we use a scatterplot to visualize the relationship. A scatterplot is simply a graph of every point in (x,y) where x is the value associated with the x-variable and y is the value associated with the y-variable. For example, we may want to see how support for Trump and Biden within a poll varies. So each observation is a poll of the national popular vote and we are going to plot the percentage of respondents in each poll supporting Biden against the percentage who support Trump.

To include two variables we are going to change our aesthetic to define both an x variable and a y variable – here aes(x = Biden, y = Trump) and we are going to label and scale the axes appropriately.

Pres2020.PV %>%
  ggplot(aes(x = Biden, y = Trump)) + 
  labs(title="Biden and Trump Support in 2020 National Popular Vote",
       y = "Trump Support",
       x = "Biden Support") + 
  geom_point(color="purple") + 
    scale_y_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1))

The results are intriguing! First the data seems like it falls along a grid. This is because of how poll results are reported in terms of percentage points and it highlights that even continuous variables may be reported in discrete values. This is consequential because it is hard to know how many polls are associated with each point on the graph. How many polls are at the point (Biden 50%, Trump 45%)? This matters for trying to determine what the relationship might be. Second, it is clear that there are some questions that need to be asked – why doesn’t Biden + Trump = 100\%?

To try to display how many observations are located at each point we have two tools at our disposal. First, we can alter the “alpha transparency” by setting alpha-.5 in the geom_point call. By setting a low level of transparency, this means that the point will become less transparent as more points occur at the same coordinate. Thus, a faint point indicates that only a single poll (observation) is located at a coordinate whereas a solid point indicates that there are many polls. When we apply this to the scatterplot you can immediately see that most of the polls are located in the neighborhood of Biden 50%, Trump 42%.

Pres2020.PV %>%
  ggplot(aes(x = Biden, y = Trump)) + 
  labs(title="Biden and Trump Support in 2020 National Popular Vote",
       y = "Trump Support",
       x = "Biden Support") + 
  geom_point(color="purple",alpha = .3) + 
    scale_y_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1))

However, the grid-like nature of the plot is still somewhat hard to interpret as it can be hard to discern variations in color gradient. Another tool is to add a tiny bit of randomness to the x and y values associated with each plot. Instead of values being constrained to vary by a full percentage point, for example, the jitter allows it to vary by less. To do so we replace geom_point with geom_jitter.

Pres2020.PV %>%
  ggplot(aes(x = Biden, y = Trump)) + 
  labs(title="Biden and Trump Support in 2020 National Popular Vote",
       y = "Trump Support",
       x = "Biden Support") + 
  geom_jitter(color="purple",alpha = .5) + 
    scale_y_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) 

Note how much the visualization changes. Whereas before the eye was focused on – and arguably distracted by – the grid-like orientation imposed by the measurement, once we jitter the points we are immediately made aware of the relationship between the two variables. While we are indeed slightly changing our data by adding random noise, the payoff is that the visualization arguably better highlights the nature of the relationship. Insofar the goal of visualization is communication, this trade-off seems worthwhile in this instance. But here again is where data science is sometimes art as much as science. The decision of which visualization to use depends on what you think most effectively communicates the nature of the relationship to the reader.

We can also look at the accuracy of a poll as a function of the sample size. This is also a relationship between two continuous variables – hence a scatterplot! Are polls with more respondents more accurate? There is one poll with nearly 80,000 respondents that we will filter out to me able to show a reasonable scale. Note that we are going to use labels = scales::comma when plotting the x-axis to report numbers with commas for readability.

Pres2020.PV %>%
  filter(SampleSize < 50000) %>%
  mutate(TrumpError = Trump - RepCertVote/100,
         BidenError = Biden - DemCertVote/100) %>%
  ggplot(aes(x = SampleSize, y = TrumpError)) + 
  labs(title="Trump Polling Error in 2020 National Popular Vote as a function of Sample Size",
       y = "Error: Trump Poll - Trump Certified Vote",
       x = "Sample Size in Poll") + 
  geom_jitter(color="purple",alpha = .5) +
  scale_y_continuous(breaks=seq(-.2,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks=seq(0,30000,by=5000),
                     labels= scales::comma) 

In sum, we have tested Trump’s theory that the MSM was biased against him. We found that polls that underpredicted Trump also underpredicted Biden. This is not what we would expect if the polls favored one candidate over another.

Pres2020.PV %>%
  ggplot(aes(x = Biden, y = Trump)) + 
  labs(title="Biden and Trump Support in 2020 National Popular Vote",
       y = "Trump Support",
       x = "Biden Support") + 
  geom_jitter(color="purple",alpha = .5) + 
    scale_y_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) 

What is an alternative explanation for these patterns? Why would polls underpredict both Trump and Biden?

Perhaps they were fielded earlier in the year, when more people were interested in third party candidates, or hadn’t made up their mind. We’ll turn to testing this theory next time!

Part IV: Poll Underpredictions (Data: 2020 Presidential Polling)

Recall that we have tested Trump’s theory that the MSM was biased against him. We found that polls that underpredicted Trump also underpredicted Biden. This is not what we would expect if the polls favored one candidate over another.

Loading the data

require(tidyverse)
require(scales)
Pres2020.PV <- read_rds(file="https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/Pres2020_PV.Rds")
Pres2020.PV <- Pres2020.PV %>%
                mutate(Trump = Trump/100,
                      Biden = Biden/100,
                      margin = Biden - Trump)
Pres2020.PV %>%
  ggplot(aes(x = Biden, y = Trump)) + 
  labs(title="Biden and Trump Support in 2020 National Popular Vote",
       y = "Trump Support",
       x = "Biden Support") + 
  geom_jitter(color="purple",alpha = .5) + 
    scale_y_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) 

What is an alternative explanation for these patterns? Why would polls underpredict both Trump and Biden?

Perhaps they were fielded earlier in the year, when more people were interested in third party candidates, or hadn’t made up their mind.

Visualizing More Dimensions – And Introducing Dates!

How did the support for Biden and Trump vary across the course of the 2020 Election?

Give you some tools to do some amazing things!

Telling Time

Dates in R

election.day <- as.Date("11/3/2020", "%m/%d/%Y")   
election.day16 <- as.Date("11/8/2016", "%m/%d/%Y")   
election.day - election.day16
## Time difference of 1456 days
as.numeric(election.day - election.day16)
## [1] 1456

Initial Questions

So, for every day, how many polls were reported by the media?

Note that we could also look to see if the poll results depend on how the poll is being done (i.e., the Mode used to contact respondents) or even depending on who funded (Funded) or conducted (Conducted) the polls. We could also see if larger polls (i.e., polls with more respondents SampleSize) or polls that took longer to conduct (DaysInField) were more or less accurate.

Let’s Wrangle…

Pres2020.PV <- Pres2020.PV %>%
                mutate(EndDate = as.Date(EndDate, "%m/%d/%Y"), 
                      StartDate = as.Date(StartDate, "%m/%d/%Y"),
                      DaysToED = as.numeric(election.day - EndDate),
                      Trump = Trump/100,
                      Biden = Biden/100,
                      margin = Biden - Trump)

What are we plotting?

ggplot(data = Pres2020.PV, aes(x = DaysToED)) +
  labs(title = "Number of 2020 National Polls Over Time",
       x = "Number of Days until Election Day",
       y = "Number of Polls") + 
  geom_bar(fill="purple", color= "black") +
  scale_x_continuous(breaks=seq(0,230,by=10)) +
  scale_y_continuous(labels = label_number(accuracy = 1))

So this is a bit weird because it arranges the axis from smallest to largest even though that is in reverse chronological order. Since November comes after January it may make sense to flip the scale so that the graph plots polls that are closer to Election Day as the reader moves along the scale to the left.

ggplot(data = Pres2020.PV, aes(x = DaysToED)) +
  labs(title = "Number of 2020 National Polls Over Time") + 
  labs(x = "Number of Days until Election Day") + 
  labs(y = "Number of Polls") + 
  geom_bar(fill="purple", color= "black") +
  scale_x_reverse(breaks=seq(0,230,by=10)) +
  scale_y_continuous(labels = label_number(accuracy = 1))

But is the bargraph the right plot to use? Should we plot every single day given that we know some days might contain fewer polls (e.g., weekends)? What if we to use a histogram instead to plot the number of polls that occur in a set interval of time (to be defined by the number of bins chosen)?

ggplot(data = Pres2020.PV, aes(x = DaysToED)) +
  labs(title = "Number of 2020 National Polls Over Time") + 
  labs(x = "Number of Days until Election Day") + 
  labs(y = "Number of Polls") + 
  geom_histogram(color="PURPLE",bins = 30) +
  scale_x_reverse()

Which do you prefer? Why or why not? Data Science is sometimes as much art as it is science - especially when it comes to data visualization!

Bivariate/Multivariate relationships

A very frequently used plot is the scatterplot that shows the relationship between two variables. To do so we are going to define an aesthetic when calling ggplot that defines both an x-variable (here EndDate) and a y-variable (here margin).

First, a brief aside, we can change the size of the figure being printed in our Rmarkdown by including some commands when defining the R chunk. Much like we could suppress messages (e.g., message=FALSE or tell R not to actually evaluate the chunk eval=FALSE or to run the code but not print the code echo=FALSE) we can add some arguments to this line. For example, the code below defines the figure to be 2 inches tall, 2 inches wide and to be aligned in the center (as opposed to left-justified). As you can see, depending on the choices you make you can destroy the readability of the graphics as ggplot will attempt to rescale the figure accordingly. The dimensions are in inches and they refer to the plotting area – an area that includes labels and margins so it is not the area where the data itself appears.

margin_over_time_plot <- Pres2020.PV %>%
  ggplot(aes(x = EndDate, y = margin)) + 
  labs(title="Margin in 2020 National Popular Vote Polls Over Time",
       y = "Margin: Biden - Trump",
       x = "Poll Ending Date") + 
  geom_point(color="purple")
margin_over_time_plot

To ``fix” this we can call the ggplot object without defining the graphical parameters.

margin_over_time_plot

Ok, back to fixing the graph. What do you think?

  1. Axes looks weird - lots of interpolation required by the consumer.

  2. Data looks “chunky”? How many data points are at each point?

To fix the axis scale we can use scale_y_continuous to chose better labels for the y-axis and we can use scale_x_date to determine how to plot the dates we are plotting. Here we are going to plot at two-week intervals (date_breaks = "2 week") using labels that include the month and date (date_labels - "%b %d").

Note here that we are going to adjust the plot by adding new features to the ggplot object margin_over_time_plot. We could also have created the graph without creating the object.

margin_over_time_plot <- margin_over_time_plot  + 
    scale_y_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
    scale_x_date(date_breaks = "2 week", date_labels = "%b %d") 
margin_over_time_plot

Now one thing that is hard to know is how many polls are at a particular point. If some points contain a single poll and others contain 1000 polls that matters a lot for how we interpret the relationship.

To help convey this information we can again use geom_jitter and alpha transparency instead of geom_point. Here we are adding +/- .005 to the y-value to change the value of the poll, but not the date. (Note that we could also use position=jitter when calling geom_point). This adds just enough error in the x (width) and y (height) values associated with each point so as to help distinguish how many observations might share a value. We are also going to use the alpha transparency to denote when lots of points occur on a similar (jittered) point. Note that when doing this code we are going to redo the plot from start to finish.

Pres2020.PV %>%
  ggplot(aes(x = EndDate, y = margin)) + 
  labs(title="Margin in 2020 National Popular Vote Polls Over Time",
       y = "Margin: Biden - Trump",
       x = "Poll Ending Date") + 
    geom_jitter(color = "PURPLE",height=.005, alpha = .4) +
    scale_y_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
    scale_x_date(date_breaks = "2 week", date_labels = "%b %d") 

In addition to plotting plotting points using geom_point we can also add lines to the plot using geom_line. The line will connect the values in sequence so it does not always make sense to include. For example, if we add the geom_line to the plot it is hard to make the case that the results are meaningful.

Pres2020.PV %>%
  ggplot(aes(x = EndDate, y = margin)) + 
  labs(title="Margin in 2020 National Popular Vote Polls Over Time",
       y = "Margin: Biden - Trump",
       x = "Poll Ending Date") + 
    geom_jitter(color="purple", alpha = .5) + 
    scale_y_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
    scale_x_date(date_breaks = "2 week", date_labels = "%b %d") +
  geom_line()

While geom_line may help accentuate variation over time, it is really not designed to summarize a relationship in the data as it is simply connecting sequential points (arranged according to the x-axis). In contrast, if we were to add geom_smooth then the plot would add the average value of nearby points to help summarize the trend. (Note that we have opted to include the associated uncertainty in the moving average off using the se=T parameter in geom_smooth.)

# Using message=FALSE to suppress note about geom_smooth
Pres2020.PV %>%
  ggplot(aes(x = EndDate, y = margin)) + 
  labs(title="Margin in 2020 National Popular Vote Polls Over Time",
       y = "Margin: Biden - Trump",
       x = "Poll Ending Date") + 
    geom_jitter(color="purple", alpha = .5) + 
    scale_y_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
    scale_x_date(date_breaks = "2 week", date_labels = "%b %d") +
    geom_smooth(color = "BLACK", se=T) 

Plotting Multiple Variables Over Time (Time-Series)

Let’s define a ggplot object to add to later. Note that this code chuck will not print anything because we need to call the object to see what it produced.

BidenTrumpplot <- Pres2020.PV %>%
  ggplot()  +
  geom_point(aes(x = EndDate, y = Trump), 
             color = "red", alpha=.4)  +
  geom_point(aes(x = EndDate, y = Biden), 
             color = "blue", alpha=.4) +
  labs(title="% Biden and Trump in 2020 National Popular Vote Polls Over Time",
       y = "Pct. Support",
       x = "Poll Ending Date") +
  scale_x_date(date_breaks = "2 week", date_labels = "%b %d") + 
  scale_y_continuous(breaks=seq(.3,.7,by=.05),
                     labels= scales::percent_format(accuracy = 1)) 

Now call the plot

BidenTrumpplot

Now we are going to add smoothed lines to the ggplot object. The nice thing about having saved the ggplot object above is that to add the smoothed lines we can simply add it to the existing plot. geom_smooth summarizes the average using a subset of the data (the default is 75%) by computing the average value for the closest 75% of the data. Put differently, after sorting the polls by their date it then summarizes the predicted value for a poll taken at that date using the closest 75% of polls according to the date. (It is not taking a mean of those values, but it is doing something similar.) After doing so for the first date, it then does the same for the second date, but now the “closest” data includes polls that happened both before an after. The smoother “moves along” the x-axis and generates a predicted value for each date. Because it is using so much of the data, the values of the line will change very slowly because the only change between two adjacent dates is by dropping and adding new information.

When calling the geom_smooth we used se=T to tell ggplot to produce an estimate of how much the prediction may vary. (This is the 95% confidence interval for the prediction being graphed.)

# Using message=FALSE to suppress note about geom_smooth
BidenTrumpplot +  
  geom_smooth(aes(x = EndDate, y = Trump), 
              color = "red",se=T) + 
  geom_smooth(aes(x = EndDate, y = Biden), 
              color = "blue",se=T)

If we change it to using only the closest 10% of the data by changing span=.1 we get a slightly different relationship. The smoothed lines are less smooth because the impact of adding and removing points is much greater when we are using less data to compute the smoothed value – a single observation can change the overall results much more than when we used so much more data. In addition, the error-bars around the smoother are larger because we are using less data to calculate each point.

# Using message=FALSE to suppress note about geom_smooth
BidenTrumpplot +  
  geom_smooth(aes(x = EndDate, y = Trump), 
              color = "red",se=T,span=.1) + 
  geom_smooth(aes(x = EndDate, y = Biden), 
              color = "blue",se=T,span=.1)

Try some other values to see how things change. Note that you are not changing the data, you are only changing how you are visualizing the relationship over time by deciding how much data to use. In part, the decision of how much data to use is a question of how much you you want to allow public opinion to vary over the course of the campaign – how much of the variation is “real” versus how much is “noise”?

Quick Exercise Choose another span and see how it changes how you interpret how much variation there is in the data over time.

# INSERT CODE

ADVANCED! We can also use fill to introduce another dimension into the visualization. Consider for example, plotting support for Trump over time for mixed-mode polls versus telephone polls versus online-only polls. How would you go about doing this? You want to be careful not to make a mess however. Just because you can doesn’t mean you should!

State Polls and the Electoral College

New functions and libraries:

First load the data of all state-level polls for the 2020 presidential election.

Pres2020.StatePolls <- read_rds(file="https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/Pres2020_StatePolls.Rds")
glimpse(Pres2020.StatePolls)
## Rows: 1,545
## Columns: 19
## $ StartDate      <date> 2020-03-21, 2020-03-24, 2020-03-24, 2020-03-28, 2020-0…
## $ EndDate        <date> 2020-03-30, 2020-04-03, 2020-03-29, 2020-03-29, 2020-0…
## $ DaysinField    <dbl> 10, 11, 6, 2, 3, 5, 2, 2, 7, 3, 3, 3, 2, 2, 3, 4, 10, 1…
## $ MoE            <dbl> 2.8, 3.0, 4.2, NA, 4.0, 1.7, 3.0, 3.1, 4.1, 4.4, NA, NA…
## $ Mode           <chr> "Phone/Online", "Phone/Online", "Live phone - RDD", "Li…
## $ SampleSize     <dbl> 1331, 1000, 813, 962, 602, 3244, 1035, 1019, 583, 500, …
## $ Biden          <dbl> 41, 47, 48, 67, 46, 46, 46, 48, 52, 42, 48, 50, 52, 38,…
## $ Trump          <dbl> 46, 34, 45, 29, 46, 40, 48, 45, 39, 49, 47, 41, 43, 49,…
## $ Winner         <chr> "Rep", "Dem", "Dem", "Dem", "Dem", "Rep", "Dem", "Dem",…
## $ poll.predicted <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Funded         <chr> "UtahPolicy.com & KUTV 2News", "Sacred Heart University…
## $ Conducted      <chr> "Y2 Analytics", "GreatBlue Research", "LHK Partners Inc…
## $ margin         <dbl> -5, 13, 3, 38, 0, 6, -2, 3, 13, -7, 1, 9, 9, -11, 6, -1…
## $ DaysToED       <drtn> 218 days, 214 days, 219 days, 219 days, 216 days, 213 …
## $ StateName      <chr> "Utah", "Connecticut", "Wisconsin", "California", "Mich…
## $ EV             <int> 6, 7, 10, 55, 16, 29, 16, 16, 12, 15, 10, 16, 11, 6, 16…
## $ State          <chr> "UT", "CT", "WI", "CA", "MI", "FL", "GA", "MI", "WA", "…
## $ BidenCertVote  <dbl> 38, 59, 49, 64, 51, 48, 50, 51, 58, 49, 49, 51, 49, 41,…
## $ TrumpCertVote  <dbl> 58, 39, 49, 34, 48, 51, 49, 48, 39, 50, 49, 48, 49, 58,…

Variables of potential interest include:

Overall Task:

Task 1: How should we translate a poll result into a predicted probability?

Suppose that I give you 10 polls from a state.

Load in the data and create the some mutations to create new variables.

Pres2020.StatePolls <- Pres2020.StatePolls %>%
   mutate(BidenNorm = Biden/(Biden+Trump),
          TrumpNorm = 1-BidenNorm,
          Biden = Biden/100,
          Trump=Trump/100)

How can you use them to create a probability? Discuss! (I can think of 3 ways.)

  • Measure 1: Fraction of polls with Biden in the lead
  • Measure 2: Biden Pct = Probability Biden wins
  • Measure 3: Normalized Biden Pct = Probability Biden wins (i.e., all voters either vote for Biden or Trump). Sometimes called “two-party” vote.

How does this vary across states? The joys of group_by(). Note that group_by() defines what happens for all subsequent code in that code chunk. So here we are going to calculate the mean separately for each state.

stateprobs <- Pres2020.StatePolls %>%
    group_by(StateName) %>%
      summarize(BidenProbWin1 = mean(Biden > Trump),
                BidenProbWin2 = mean(Biden),  
                BidenProbWin3 = mean(BidenNorm))

stateprobs
## # A tibble: 50 × 4
##    StateName   BidenProbWin1 BidenProbWin2 BidenProbWin3
##    <chr>               <dbl>         <dbl>         <dbl>
##  1 Alabama             0             0.389         0.407
##  2 Alaska              0             0.442         0.466
##  3 Arizona             0.840         0.484         0.519
##  4 Arkansas            0             0.381         0.395
##  5 California          1             0.618         0.661
##  6 Colorado            1             0.534         0.571
##  7 Connecticut         1             0.584         0.631
##  8 Delaware            1             0.603         0.627
##  9 Florida             0.798         0.486         0.517
## 10 Georgia             0.548         0.474         0.504
## # ℹ 40 more rows

Clearly they differ, so let’s visualize to try to understand what is going on. Install the library plotly

library(plotly)
gg <- stateprobs %>%
  ggplot(aes(x=BidenProbWin2, y=BidenProbWin3,text=paste(StateName))) +
  geom_point() +
  geom_abline(intercept=0,slope=1) +
  labs(x= "Probability as % Support",
       y = "Probability as Two-Party % Support",
       title = "Comparing Probability of Winning Measures")

ggplotly(gg,tooltip = "text")

So removing the undecided and making the probabilities for Biden and Trump sum to 100% is consequential.

What about if we compare these measures to the fration of polls with a given winner? After all, it seems implausible that the Biden would ever lose California or Trump would ever lose Tennessee.

library(plotly)
gg <- stateprobs %>%
  ggplot(aes(x=BidenProbWin2, y=BidenProbWin1,text=paste(StateName))) +
  geom_point() +
  geom_abline(intercept=0,slope=1) +
  labs(x= "Probability as % Support",
       y = "Probability as % Polls Winning",
       title = "Comparing Probability of Winning Measures")

ggplotly(gg,tooltip = "text")

So what do you think? Exactly the same data, but just different impications depending on how you choose to measure the probability of winning a state. Data sciene is as much about argument and reasoning as it is about coding. How we measure a concept is often critical to the conclusions that we get.

Task 2: Start “simple” – calculate the probability that Biden wins PA

But we want to combine these probabilities with the Electoral College votes in each state. Not every state has the same amount of Electoral College votes – it is typically given by the number of Senators (2) plus the number of representatives (at least 1) so we need to account for this if we want to make a projection about who is going to win the Electoral College.

  • Create a tibble with just polls from PA.
PA.dat <- Pres2020.StatePolls %>% 
  filter(State == "PA")
  • Now compute these three probabilities. What functions do we need?
PA.dat %>%
      summarize(BidenProbWin1 = mean(Biden > Trump),
                BidenProbWin2 = mean(Biden),  
                BidenProbWin3 = mean(BidenNorm))
## # A tibble: 1 × 3
##   BidenProbWin1 BidenProbWin2 BidenProbWin3
##           <dbl>         <dbl>         <dbl>
## 1         0.916         0.499         0.529
  • What do you think about this?

Task 3: Given that probability, how do we change the code to compute the expected number of Electoral College Votes EV for Biden?

  • Keep the code from above and copy and paste so you can understand how each step changes what we are doing. Note that we have the number of electoral college votes associated with each state EV that we want to use to compute the expected number of electoral college votes. But recall that when we summarize we change the tibble to be the output of the function. So how do we keep the number of Electoral College votes for a future mutation?
PA.dat %>%
      summarize(BidenProbWin1 = mean(Biden > Trump),
                BidenProbWin2 = mean(Biden),
                BidenProbWin3 = mean(BidenNorm),
                EV = mean(EV)) %>%
      mutate(BidenEV1 = BidenProbWin1*EV,
             BidenEV2 = BidenProbWin2*EV,
             BidenEV3 = BidenProbWin3*EV)
## # A tibble: 1 × 7
##   BidenProbWin1 BidenProbWin2 BidenProbWin3    EV BidenEV1 BidenEV2 BidenEV3
##           <dbl>         <dbl>         <dbl> <dbl>    <dbl>    <dbl>    <dbl>
## 1         0.916         0.499         0.529    20     18.3     9.98     10.6

Note that we are calculation the Expected Value of the Electoral College votes using: Probability that Biden wins state i X Electoral College Votes in State i. This will allocate fractions of Electoral College votes even though the actual election is winner-take all. This is OK because the fractions reflect the probability that an alternative outcome occurs.

Quick Exercise How can we get compute the expected number of Electoral College votes for Trump in each measure? NOTE: There are at least 2 ways to do this because this is a 2 candidate race

# INSERT CODE HERE
  • EV-BidenEV, or compute TrumpProbWin

Task 4: Now generalize to every state by applying this code to each set of state polls.

  • What do we need to do this calculation for every state in our tibble?
  • First, compute probability of winning a state. (How?)
  • Second, compute expected Electoral College Votes. (How?)
Pres2020.StatePolls %>%  
  group_by(StateName) %>%
    summarize(BidenProbWin1 = mean(Biden > Trump),
              BidenProbWin3 = mean(BidenNorm),
              EV = mean(EV),
              State = first(State)) %>%
    mutate(State = State,
              BidenECVPredicted1 = EV*BidenProbWin1,
              TrumpECVPredicted1 = EV- BidenECVPredicted1,
              BidenECVPredicted3 = EV*BidenProbWin3,
              TrumpECVPredicted3 = EV- BidenECVPredicted3) %>%
  summarize(BidenECVPredicted1=sum(BidenECVPredicted1),
            BidenECVPredicted3=sum(BidenECVPredicted3),
            TrumpECVPredicted1=sum(TrumpECVPredicted1),
            TrumpECVPredicted3=sum(TrumpECVPredicted3),)
## # A tibble: 1 × 4
##   BidenECVPredicted1 BidenECVPredicted3 TrumpECVPredicted1 TrumpECVPredicted3
##                <dbl>              <dbl>              <dbl>              <dbl>
## 1               345.               289.               190.               246.

Task 5: Now compute total expected vote by adding to that code

  • NOTE: Actually 306 - 232
  • What do we need to do to the tibble we created in Task 4 to get the overall number of Electoral College Votes?
Pres2020.StatePolls %>%  
  group_by(StateName) %>%
    summarize(BidenProbWin1 = mean(Biden > Trump),
              BidenProbWin3 = mean(BidenNorm),
              EV = mean(EV)) %>%
    mutate(BidenECVPredicted1 = EV*BidenProbWin1,
              TrumpECVPredicted1 = EV- BidenECVPredicted1,
              BidenECVPredicted3 = EV*BidenProbWin3,
              TrumpECVPredicted3 = EV- BidenECVPredicted3) %>%
    summarize(BidenECV1 = sum(BidenECVPredicted1),
              TrumpECV1 = sum(TrumpECVPredicted1),
              BidenECV3 = sum(BidenECVPredicted3),
              TrumpECV3 = sum(TrumpECVPredicted3))
## # A tibble: 1 × 4
##   BidenECV1 TrumpECV1 BidenECV3 TrumpECV3
##       <dbl>     <dbl>     <dbl>     <dbl>
## 1      345.      190.      289.      246.

Quick Exercise Could also do this for just polls conducted in the last 7 days. How?

# INSERT CODE HERE

THINKING: What about states that do not have any polls? What should we do about them? Is there a reason why they might not have a poll? Is that useful information? Questions like this become more relevant when we start to restrict the sample.

Here are the number of polls done in each state in the last 3 days. Note that when we use fewer days our measure based on the percentage of polls won may be more affected?

Pres2020.StatePolls %>%
  filter(DaysToED < 3) %>%
  count(State) %>%
  ggplot(aes(x=n)) +
  geom_bar() + 
  scale_x_continuous(breaks=seq(0,15,by=1)) +
  labs(x="Number of Polls in a State",
       y="Number of States",
       title="Number of Polls in States \n in the Last 3 Days of 2020")